Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(cohorts): optimized select from cohort_people #21564

Merged
merged 14 commits into from
Apr 17, 2024
Merged

Conversation

mariusandra
Copy link
Collaborator

@mariusandra mariusandra commented Apr 16, 2024

Problem

The following HogQL doesn't always return what you expect:

select person_id, cohort_id from cohort_people

Behind the scenes it's translated to an operation that groups by person_id, cohort_id and with having(sum(sign)) > 0. This works well if the data is correct, but sometimes cohort updates fail leaving messy data. Here's a customer case where sum(sign) equals -16 or -9 or 0 over all rows, even though the last version has all with a +1 sign.

Changes

In "In Cohort" queries we ignore the "sign" field and just check for the latest "version" field. When selecting from the table itself we used just sum(sign) > 0, ignoring the versions.

We'll now do the same we do with 'in cohort' checks: fetch the version information from Postgres and inline it into the query.

Thus this swaps the query select cohort_id, person_id from cohort_people in the following way.

Before:

select cohort_id, person_id from (
    select cohort_id, person_id 
    from raw_cohort_people
    group by cohort_id, person_id
    having sum(sign) > 0
)

After:

select cohort_id, person_id from (
    select cohort_id, person_id 
    from raw_cohort_people
    where (cohort_id, version) in ((1, 2), (3, 4), (4, 5)) and sign > 0
)

I tried inlining the versions in ClickHouse with a (select cohort_id, max(version) as version from raw_cohort_people group by cohort_id) subquery, but on "team 2", this query alone took ~5 sec. Smaller customer teams ran much faster, but this is obviously slow for larger users.

Does this work well for both Cloud and self-hosted?

Yes

How did you test this code?

WIP

@neilkakkar
Copy link
Collaborator

neilkakkar commented Apr 16, 2024

This is one main reason we moved off of sign and onto version: The collapsing merge tree has shit-resiliency. Ideally nothing uses the cohort sign queries anymore.

I don't think you need to filter on sign either in the new query? If the update is stale (ex: new version has added stuff since you queried for the version from postgres) then you still want to return all data from this version, rather than just the +ve signed data, since the latter may be empty, and the former will be all persons from the previous computation, which is much closer.

And if its the latest version, sign has no meaning anyway

@mariusandra
Copy link
Collaborator Author

mariusandra commented Apr 16, 2024

Thanks for the context!

When I looked around, all previous versions essentially had the same data, but with a -1 sign. I fear without checking for sign > 0, if a version upgrade is taking place while the query is taking place (or some other timing related incident causes us to fetch an older version), we'll suddenly have all person_id-s returned double. To combat this, I should either use a distinct somewhere, or then only check positive versions. Not sure what's faster

@neilkakkar
Copy link
Collaborator

ohh good point! I think you should go with distinct.

The problem with just sign > 0 is that if some rows have collapsed already (+1, -1 sign merged into an empty one), that person_id will be gone, so it can lead to some random people going in and out of the cohort, which will be very hard to debug, and pretty non-deterministic.

@mariusandra mariusandra requested a review from neilkakkar April 16, 2024 13:13
@mariusandra mariusandra marked this pull request as ready for review April 16, 2024 13:13
team_id=self.team.pk,
distinct_ids=["2"],
properties={"$some_prop": "something", "$another_prop": "something2"},
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will like this test more if it also had a third person where $some_prop = "not something"

],
name="cohort1",
)
cohort1.calculate_people_ch(pending_version=0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very optional, to make it even more rock solid, you can

Suggested change
cohort1.calculate_people_ch(pending_version=0)
cohort1.calculate_people_ch(pending_version=0)
cohort1.calculate_people_ch(pending_version=2)
cohort1.calculate_people_ch(pending_version=4)

to simulate a few recalculations

WHERE equals(cohortpeople.team_id, 420)
GROUP BY person_id, cohort_id, cohort_people___person_id
HAVING ifNull(greater(sum(cohortpeople.sign), 0), 0)) AS cohort_people LEFT JOIN (
WHERE and(equals(cohortpeople.team_id, 420), false)) AS cohort_people LEFT JOIN (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

false - I'm assuming because wherever this test is hasn't setup the cohort version?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Precisely. We have never stored any data in the cohort, so no need to query anything. Added an extra test to check it as well.

@@ -111,7 +111,7 @@ export function HogQLDebug({ query, setQuery, queryKey }: HogQLDebugProps): JSX.
<LemonSelect
options={[
{ value: 'auto', label: 'auto' },
{ value: 'leftjoin', label: 'join' },
{ value: 'leftjoin', label: 'leftjoin' },
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flyby fix 🙈

Copy link
Contributor

github-actions bot commented Apr 17, 2024

Size Change: 0 B

Total Size: 999 kB

ℹ️ View Unchanged
Filename Size
frontend/dist/toolbar.js 999 kB

compressed-size-action

@mariusandra mariusandra merged commit ca49427 into master Apr 17, 2024
104 checks passed
@mariusandra mariusandra deleted the cohort-group branch April 17, 2024 11:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants